Firstly, I will start by downloading 50 videos per subset. I believe the more videos we have the broader the analysis can be. Moreover, I think it is essential to point out that I will compare the subsets by comparing their distribution of the features based on the colors because even one specific subset of the video can be very different data-wise. For example, the cinema had a very big technology advance in the last 20 years, and movies look differently at the start and the end of this era. Moreover, color movies started to appear in the 40s and 50s thus in the first subset should be just black and white movies.
Considering the vast dimensionality of the data I will start with scene detection. For this method, I have decided to set the threshold to 10 which produces more data points - frames. After collecting the middle frames of the scenes where each subset contains corresponding frames, I will come to the analysis itself. Since I am planning to do a comparison of the distribution of the data from subsets, it is not necessary to think about having the same number of frames in each subset. This issue of having different numbers of frames in subsets is very probable since the length and video capturing attitude changed during the times, so the number of scenes can be different too.
I think the best way how to analyze the subsets from different periods is to analyze their color features from the distribution point of view. I believe that the quality of the graphics of trailers and movies changed during times. I think that the trailers in recent years had better image quality, thus more saturation and more hue. I will visualize the data and compare the subsets. Moreover, we have to take into account that the first subset contains black and white movies since the color movie started to evolve mainly in 1940. Moreover, I will also look at the distribution of lightness of the frames and distribution of median of RGB colors of each frame just from curiosity. The last feature I will use is the ratio of the video. I think it would be very interesting to compare evolution in width and height during the three time periods. For example, televisions have a different ratio of screen size now - more wide to be specific - than they had before.
Initialize libraries:
import pandas as pd
import os
import wget
from tqdm.auto import tqdm
import cv2
import numpy as np
from scenedetect import VideoManager
from scenedetect import SceneManager
from scenedetect.detectors import ContentDetector
import colorgram
from matplotlib.colors import to_hex
from tensorflow.keras.preprocessing import image
from PIL import Image
import plotly.express as px
from matplotlib import pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
Make subsets:
movies = pd.read_csv("trailers.csv") #load csv
sub2040 = movies.loc[(movies.year >= 1920) & (movies.year <= 1940), ] #make subset 1920 - 1940
sub2040 = sub2040.sample(50, random_state = 42) #sample 50 trailers
sub6080 = movies.loc[(movies.year >= 1960) & (movies.year <= 1980), ] #make subset 1960 - 1980
sub6080 = sub6080.sample(50, random_state = 42) #sample 50 trailers
sub0020 = movies.loc[(movies.year >= 2000) & (movies.year <= 2020), ] #make subset 2000 - 2020
sub0020 = sub0020.sample(50, random_state = 42) #sample 50 trailers
Define useful functions:
def download_videos(subset, output_folder, subset_folder):
"""
Downloads videos based on data in dataframe
Parameters
----------
subset : pandas dataframe
dataframe with info about videos to be downloaded
output_folder : string
string with the name of the main subfolder
subset_folder : string
string with the name of the subset folder
Returns
----------
video_paths : list
list of videopaths
"""
if not os.path.exists(output_folder): #if the output folder does not exist
os.mkdir(output_folder) #create folder
if not os.path.exists(os.path.join(output_folder, subset_folder)): #if ouput and subset folder does not exist
os.mkdir(os.path.join(output_folder, subset_folder)) #create folder
video_paths = [] #initialize list
for video in tqdm(subset.itertuples(), total=len(subset)): #loop through videos to be downloaded
video_url = video.url #get video url
output_path = os.path.join(output_folder, subset_folder, video.trailer_title + '.mp4') #get output path
if not os.path.exists(output_path): #if path does not exist
filename = wget.download(video_url, out=output_path) #get filename
video_paths.append(output_path) #append video paths
return video_paths
def find_scenes(video_path, threshold):
"""
Finds scenes in video
Parameters
----------
video_path : string
path of the video
threshold : int
threshold for scenes
Returns
----------
scenes
"""
video_manager = VideoManager([video_path]) #intialize video manager
scene_manager = SceneManager() #initialize scene manager
scene_manager.add_detector(
ContentDetector(threshold=threshold)) #add detector
base_timecode = video_manager.get_base_timecode() #add base time
video_manager.set_downscale_factor()
video_manager.start()
scene_manager.detect_scenes(frame_source=video_manager, show_progress=False)
return scene_manager.get_scene_list(base_timecode)
def get_frames_for_set(paths):
"""
Gets frame for set of videopaths
Parameters
----------
paths : list
lists of videopaths
Returns
----------
list of frames
"""
frames = [] #initialize list
for filename in paths: #loop through all paths
scene_list = find_scenes(filename, threshold=10) #get scene list
cap = cv2.VideoCapture(filename) #initialize video capture
shot_length = [] #initialize list
for start_time, end_time in scene_list: #loop through all scenes
duration = end_time - start_time #duration of the scene
frame = (start_time.get_frames() + int(duration.get_frames() / 2)) #get middle of the frame
cap.set(cv2.CAP_PROP_POS_FRAMES,frame)
ret, frame = cap.read()
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame) #add frame to the list
shot_length.append(duration.get_seconds()) #add shot length to the list
return frames
def load_image_from_path(image_path, target_size=None, color_mode='rgb'):
"""
Loads image from path
Parameters
----------
image_path : string
path of the image
target_size : tuple
target size of the image
color_mode : string
rgb or grayscale
Returns
----------
loaded imag
"""
pil_image = image.load_img(image_path,
target_size=target_size,
color_mode=color_mode) #load image
return image.img_to_array(pil_image) #return image
def save_frames(frames, sub):
"""
Saves frames as jpgs into the specific directory
Parameters
----------
frames : list
list of frames to be saved
sub : string
string which specifies directory
"""
s = 'scenes' + sub + '/' #get folder path
if not os.path.exists(s): #if folder does not exist, create
os.mkdir(s)
for i, frame in enumerate(frames): #loop through frames
cv2.imwrite((s + 'frame_{}.jpg'.format(i)), frame) #save frame as jpg
def get_subset_info(key):
"""
Gets all the info for each subset
Parameters
----------
key : string
string which specifies directory
Returns
----------
coulurs : list of lists
list of lists with median values of rgb per image
ratios : list
list of ratio values per image
saturation : list
list of saturation values per image
hue : list
list of hue values per image
lightness : list
list lightness values per image
"""
directory = os.fsencode(key) #get all files in directory
colours = [] #initialize list for rgb median values
ratios = [] #initialize list for ratios
saturation = [] #initialize list for saturation
hue = [] #initialize list for hue
lightness = [] #initialize list for lightness
for file in tqdm(os.listdir(directory)): #loop through all the files
filename = os.fsdecode(file) #get filename
if filename.startswith('.'): #skip files which starts with dot (mainly ipynb checkpoints)
continue
color_image = load_image_from_path(key + filename,
color_mode='rgb') #load the image
median_r, median_g, median_b = np.median(color_image, axis=(0,1)) #get median values of the image
colours.append([median_r, median_g, median_b]) #append to the list
width = color_image.shape[1] #get width of the image
height = color_image.shape[0] #get height of the image
aspect_ratio = width / height #compute aspect ratio
ratios.append(aspect_ratio) #append to the list
color_image_hsv = cv2.cvtColor(color_image, cv2.COLOR_RGB2HSV)
median_h, median_s, median_v = np.median(color_image_hsv, axis=(0,1)) #get median of hue, saturation and lightness
hue.append(median_h) #append hue value to the list
lightness.append(median_v) #append lightness value to the list
saturation.append(median_s) #append saturation value to the list
return colours, ratios, saturation, hue, lightness
def visualize_boxplots(data, main, x, y):
"""
Visualize boxplots of data
Parameters
----------
data : dictionary
dictionary filled with data
main : string
name of the graph
x : string
name of the x axis
y : string
name of the y axis
"""
df = pd.DataFrame(dict(stamp = np.concatenate((["1920-1940"]*len(data["1"]), ["1960-1980"]*len(data["2"]), ["2000-2020"]*len(data["3"]))),
vals = np.concatenate((data["1"],data["2"], data["3"])))) #create dataframe
fig = px.box(df, x="stamp", y="vals") #create boxplot
fig.update_layout(title = main, #update layout
xaxis_title = x,
yaxis_title = y)
return fig
def convert_array_of_arrays(array, index):
"""
Get list of specific color from rgb array of arrays
Parameters
----------
array : list
contains rgb values, list of list
index : int
0 - r value, 1 - g value, 2 - b value
"""
result = [] #initialize list
for i in range(0, len(array)): #loop through all rgb values
result.append(array[i][index]) #get specific color part
return result
def visualize_rgb(data, main, x, y, index):
"""
Visualize boxplots of data
Parameters
----------
data : dictionary
dictionary filled with data
main : string
name of the graph
x : string
name of the x axis
y : string
name of the y axis
index : int
specifies the subset
"""
df = pd.DataFrame(dict(subset = np.concatenate((["1920-1940"]*len(data["1"]), ["1960-1980"]*len(data["2"]), ["2000-2020"]*len(data["3"]))),
col = np.concatenate((convert_array_of_arrays(data["1"], index),convert_array_of_arrays(data["2"], index), convert_array_of_arrays(data["3"], index)))))
#construct sufficient dataframe for plotting
fig = px.box(df, x="subset", y="col") #create box plot
fig.update_layout(title = main, #update fig layout
xaxis_title = x,
yaxis_title = y)
fig.show() #show fig
Download videos:
videos = {}
videos[1] = download_videos(sub2040,
output_folder='videos', # video will be in the folder 'videos'
subset_folder='1920_1940') # and in the folder 1920-1940)
videos[2] = download_videos(sub6080,
output_folder='videos', # video will be in the folder 'videos'
subset_folder='1960_1980') # and in the folder 1920-1940)
videos[3] = download_videos(sub0020,
output_folder='videos', # video will be in the folder 'videos'
subset_folder='2000_2020') # and in the folder 1920-1940)
0%| | 0/50 [00:00<?, ?it/s]
0%| | 0/50 [00:00<?, ?it/s]
0%| | 0/50 [00:00<?, ?it/s]
Prepare frames from video:
frames = {} #initialize dictionary
frames["1"] = get_frames_for_set(videos[1]) #get frames for first subset
frames["2"] = get_frames_for_set(videos[2]) #get frames for second subset
frames["3"] = get_frames_for_set(videos[3]) #get frames for third subset
Save frames and path of directories:
path = {} #initalize dictionary
for key in frames: #loop through dictionary
save_frames(frames[key], key) #save all the frames
path[key] = 'scenes' + key + '/' #store dictionary of the frames
Get all the info about subset:
medians = {} #initialize dictionary of medians
saturation = {} #initialize dictionary of saturation
hue = {} #initialize dictionary of hue
lightness = {} #initialize dictionary of lightness
ratios = {} #initialize dictionary of ratios
for key in path: #loop through the paths of the subset
medians[key], ratios[key], saturation[key], hue[key], lightness[key] = get_subset_info(path[key]) #get info of the subset
0%| | 0/3625 [00:00<?, ?it/s]
0%| | 0/6636 [00:00<?, ?it/s]
0%| | 0/6196 [00:00<?, ?it/s]
Visualize ratios:
ratios_graph = visualize_boxplots(ratios, "Graph of ratios distribution", "subset", "ratios")
ratios_graph.show()
sat_graph = visualize_boxplots(saturation, "Graph of saturation distribution", "subset", "saturations")
sat_graph.show()
Visualize hue:
visualize_boxplots(hue, "Graph of hue distribution", "subset", "hue").show()
Visualize lightness:
visualize_boxplots(lightness, "Graph of lightness distribution", "subset", "lightness").show()
Visualize medians of red color:
visualize_rgb(medians, "Medians of reds", "subset", "r colour", 0)
Visualize medians of green color:
visualize_rgb(medians, "Medians of blues", "subset", "b colour", 1)
Visualize medians of blue color:
visualize_rgb(medians, "Medians of greens", "subset", "g colour", 2)
Looking at the distribution of ratios in the corresponding plot we can see that during 1920-1940 the ratio was mainly around 1.3. In the second subset, there are sometimes even square ratios (1.0). However, during the upcoming years, there was a change and in 2000-2020 the frame is always wide. Looking at saturation plot we can find some interesting results. The saturation is very low for the oldest data set. This might be a consequence of using just black and white color. Simultaneously, it could suggest that technology evolved and the pictures are full of more quality colors which have more saturation.
Surprisingly, looking at the distribution of medians of red, green, and blue color, we cannot see any big differences in the distribution of the data. Looking at the distribution of hue we can see that the median is at 0 which confirms my hypothesis that the first and oldest subset contains mainly black and white pictures.
To conclude, I believe that color-wise the subset from 1920 to 1940 differ greatly from both other subsets. However, the two other subsets themselves can be considered similar when doing color analysis as the distribution of features is nearly the same. However, if one would look at the trailers, there could be seen a difference. Thus, in the future, a different method could be tried to compare the pictures in order to see if we can distinguish these two data sets - 2000-2020 and 1960-1980.